For the evaluation of different tube-kit combinations in the second phase of exRNAQC, blood was drawn from 5 healthy volunteers. We tested 2 different kits in combination with 3 different tubes that delivered the best results in phase 1:
Kits:
Tubes:
The metrics that were used in phase 1 to compare kits (exRNA004) and tubes (exRNA005) are repeated here, this includes:
In order to compare tube-kit combinations, five performance metrics were evaluated. We calculated for every combination the fold change across different timepoints (0h vs 4h and 0h vs 16h) and the z-scores. Before z-score transformation, we made sure that a higher value always corresponds to better performance. To account for the low sample size, we calculated robust z-scores.
Thus, 5 fold-change values and Z-scores were obtained per combination. These were visualized and tube-kit combinations were ordered based on the mean of these values.
For every metric evaluated here, we tested the possibility of interaction between tube, RNA isolation kit and/or time. This analysis can be found in a seperate document: “Repeated measures analyses”.
Sample annotation with info about tube, kit, used input volume, eluate volume etc., volume unit is µL.
Hemolysis was measured with Nanodrop (absorbance at 414 nm) in plasma across all tubes. Hemolysis was below 0.2 in all samples, differences between tubes are therefore negligible.
ggplot(samples,aes(x=TimeLapse,y=Hemolysis, col=donorID, group=donorID))+
geom_point()+geom_line(aes(group = donorID))+
theme_point+
facet_wrap(~Tube, ncol = 3, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) + labs(title="hemolysis in plasma - split by tube", y = "abs at 414 nm", col = NULL, x = NULL)
#ggsave("./data_output/plots/hemolysis_PFP_lineplot_byTube.pdf", plot = p1, dpi = 300)
ggplot(samples,aes(x=TimeLapse,y=Hemolysis, col=Tube, group = Tube))+
geom_point()+geom_line()+
theme_point+
facet_wrap(~donorID, ncol = 5, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) + labs(title="hemolysis in plasma - split by donor", y = "abs at 414 nm", col = NULL, x = NULL)
#ggsave("./data_output/plots/hemolysis_PFP_lineplot_byDonor.pdf", plot = p1, dpi = 300)
Hemolysis change between timpoints, 1st timepoint: 0h vs 4h, 2nd timepoint: 0h vs 16h.
tmp_summary <- samples %>% group_by(GraphTube) %>% dplyr::summarize(mean_Hemolysis= mean(Hemolysis), sd_Hemolysis=sd(Hemolysis)) %>% mutate(CV_Hemolysis = sd_Hemolysis/mean_Hemolysis*100) %>% left_join(unique(dplyr::select(samples, c("Tube","GraphTube"))), by="GraphTube")
FC <- generateFC_tubes(input_df = samples, variable = "Hemolysis", dfName = "hemolysis")
names(FC)[names(FC) == 'foldchange'] <- 'hemolysis_FC'
#ggsave("./data_output/plots/hemolysis_FC.pdf", plot = ggplot2::last_plot(), dpi = 300)
hemolysis_FC_plot <- ggplot2::last_plot()
samples_comb_exRNAQC005 <- rbind(samples, samples_exRNAQC005)
ggplot(samples_comb_exRNAQC005,aes(x=TimeLapse,y=Hemolysis, col=BasecampID, group=donorID))+
geom_point()+geom_line(aes(group = donorID))+
theme_point+
facet_wrap(~Tube, ncol = 3, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_color_discrete(name = "study ID") +
scale_y_continuous(limits = c(0, NA)) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) + labs(title="hemolysis in plasma - split by tube", y = "abs at 414 nm", col = NULL, x = NULL)
reads_pre = read.table(paste0(data_path, "lines_fastq_pre.txt"),header=FALSE, col.names = c("pre", "RNAID"))
reads_qc = read.table(paste0(data_path, "lines_fastq_qc.txt"),header=FALSE, col.names = c("qc", "RNAID"))
reads_subs = read.table(paste0(data_path, "lines_fastq_subs.txt"),header=FALSE, col.names = c("subs", "RNAID"))
reads_post = read.table(paste0(data_path, "lines_fastq_post.txt"),header=FALSE, col.names = c("post", "RNAID"))
reads <- full_join(reads_pre,reads_qc,by=c("RNAID"))
reads <- full_join(reads,reads_subs,by=c("RNAID"))
reads <- full_join(reads,reads_post,by=c("RNAID"))
reads_melt <- reads %>% melt(value.name = "reads")
reads$duplicates <- round(1 - reads$post/reads$subs,4)
reads <- reads %>% dplyr::select(RNAID, duplicates, post, subs, qc, pre)
reads_melt <- full_join(reads_melt, samples, by = c("RNAID"))
reads_melt_pre <- reads_melt %>% filter(variable == "pre")
reads_melt_pre <- reads_melt_pre %>% mutate(reads_pct=round(reads/sum(reads)*100,2))
#write.csv(reads_melt_pre, file='./data_output/reads_melt_pre.csv')
ggplot(reads_melt_pre,aes(x=SampleID,y=reads_pct, color = Tube, group=Tube, shape=RNAisolation))+
geom_point(size=2.5)+
ylim(0,2) + theme_bar+ theme(axis.text.x=element_blank()) +
labs(y = "percentage reads per sample", title="basespace indexing %", col = NULL)+
geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray") +
scale_color_manual(values=color_panel) +
scale_shape_manual(values = shape_panel3)
Indexing QC
The reads were counted on the fastq level and plotted in the following plots. We used these commands to count the reads in the fastq files:
for sample in $(ls | grep "RNA0"); do lines=4; qcfil_lines=`zcat ${sample}/${sample}_1.fastq.gz | wc -l`; echo $((qcfil_lines/lines)) $(echo $sample | cut -d'-' -f1); done >> lines_fastq_pre.txt
for sample in $(ls | grep "RNA0"); do lines=4; qcfil_lines=`zcat ${sample}/dedup_clumpify-subs_atstart/${sample}_1_qc_subs.fastq.gz | wc -l`; echo $((qcfil_lines/lines)) $(echo $sample | cut -d'-' -f1); done >> lines_fastq_subs.txt
for sample in $(ls | grep "RNA0"); do lines=4; qcfil_lines=`zcat ${sample}/dedup_clumpify-subs_atstart/${sample}_1_qc_subs_clumped.fastq.gz | wc -l`; echo $((qcfil_lines/lines)) $(echo $sample | cut -d'-' -f1); done >> lines_fastq_post.txt
Quality filtering of sequenced reads (keep PE reads were at least 80% of bases in both reads have a phred score ≥ 20) Amount of paired reads after filtering: min= 14 951 681, mean= 37 728 539, max= 83 333 545
Randomly downsample everything to the lowest number of paired end reads (at FASTQ level) to make sure the comparison of metrics is fair. E.g. if one sample is sequenced deeper it is likely to yield more genes compared to a sample that was sequenced less deep.
As the lowest number of reads is 14 951 681, we downsampled all samples to 14M paired end reads.
pd <- position_dodge(0.2)
# reads_melt_T0 <- reads_melt %>% filter(TimeLapse == "T0")
#
# ggplot(reads_melt_T0,aes(x=reorder(variable, desc(variable)),y=reads, group = RNAID, col = RNAisolation, shape = GraphTube))+
# geom_line(position = pd)+
# geom_point(position = pd) +
# scale_y_continuous(trans = "log10") +
# labs(y = "# reads", x = "stage in pipeline", title = "T0") +
# theme_bar+ theme(axis.text.x=element_text(angle=90, hjust=1)) +
# scale_y_continuous(limits = c(0,100000000))+
# scale_x_discrete(labels = c("raw read count","read count after QC","read count after subsampling", "reads count after pipeline"), limits=c("pre", "qc", "subs", "post"))+
# scale_color_manual(values=color_panel2)+
# scale_shape_manual(values=shape_panel2)
#
# reads_melt_T04 <- reads_melt %>% filter(TimeLapse == "T04")
#
# ggplot(reads_melt_T04,aes(x=reorder(variable, desc(variable)),y=reads, group = RNAID, col = RNAisolation, shape = GraphTube))+
# geom_line(position = pd)+
# geom_point(position = pd) +
# scale_y_continuous(trans = "log10") +
# labs(y = "# reads", x = "stage in pipeline", title = "T4") +
# theme_bar+ theme(axis.text.x=element_text(angle=90, hjust=1)) +
# scale_y_continuous(limits = c(0,100000000))+
# scale_x_discrete(labels = c("raw read count","read count after QC","read count after subsampling", "reads count after pipeline"), limits=c("pre", "qc", "subs", "post"))+
# scale_color_manual(values=color_panel2)+
# scale_shape_manual(values=shape_panel2)
#
# reads_melt_T16 <- reads_melt %>% filter(TimeLapse == "T16")
# ggplot(reads_melt_T16,aes(x=reorder(variable, desc(variable)),y=reads, group = RNAID, col = RNAisolation, shape = GraphTube))+
# geom_line(position = pd)+
# geom_point(position = pd) +
# scale_y_continuous(trans = "log10") +
# labs(y = "# reads", x = "stage in pipeline", title = "T16") +
# theme_bar+ theme(axis.text.x=element_text(angle=90, hjust=1)) +
# scale_y_continuous(limits = c(0,100000000))+
# scale_x_discrete(labels = c("raw read count", "read count after QC", "read count after subsampling", "reads count after pipeline"), limits=c("pre", "qc", "subs", "post"))+
# scale_color_manual(values=color_panel2)+
# scale_shape_manual(values=shape_panel2)
ggplot(reads_melt,aes(x=reorder(variable, desc(variable)),y=reads, group = RNAID, col = RNAisolation, shape = RNAisolation))+
geom_line(position = pd)+
geom_point(position = pd) +
facet_wrap(~Tube)+
labs(y = "# reads", x = "stage in pipeline", title = "Number of reads") +
theme_bar+ theme(axis.text.x=element_text(angle=90, hjust=1)) +
scale_y_continuous(trans = "log10")+
scale_x_discrete(labels = c("raw read count","read count after QC","read count after subsampling", "reads count after pipeline"), limits=c("pre", "qc", "subs", "post"))+
scale_color_manual(values=color_panel2)+
scale_shape_manual(values=shape_panel3)
#p_all <- ggarrange(p1 + labs(title="T0"),p2 + labs(title="T04"),p3 + labs(title="T16"), common.legend = TRUE, nrow = 3, legend = "bottom")
#ggsave(plot = ggplot2::last_plot(), filename = file.path(output_path, "plots", "reads.pdf"), useDingbats=F)
# Reads should be subsampled to: reads_melt %>% filter(variable == "pre") %>% select(reads) %>% min()
# Number of reads compared with phase 1
#kit_reads = read.table("./data_raw_exRNAQC004/lines_fastq_clumpify.txt", header=TRUE) %>%
# dplyr::rename(pre=orig_lines,qc=qcfil_lines,subs=subs_lines,post=clump_lines)%>%
# separate(col = samplename, c("UniqueID","number"), "-")%>%
# select(-number)
#kit_reads_melt <- kit_reads %>% melt(value.name = "reads")
#kit_reads$duplicates <- round(1 - kit_reads$post/kit_reads$subs,4)
#kit_reads_melt <- merge(kit_reads, samples_comb, by = c("UniqueID"))
#tube_reads
#reads_melt_T0 <- reads_melt %>% filter(TimeLapse == "T0")
kallisto_aligned <- read_tsv("./data_raw/multiqc_kallisto.txt")
kallisto_aligned$`Sample` <- gsub("-.*", "", kallisto_aligned$`Sample`)
kallisto_aligned <- merge(samples, kallisto_aligned, by.x = "RNAID", by.y = "Sample")
kallisto_aligned$percent_aligned <- round(kallisto_aligned$percent_aligned, 2)
#write.csv(kallisto_aligned, file='./data_output/kallisto_aligned.csv')
ggplot(kallisto_aligned,aes(x=tube_kit_timeID,y=percent_aligned, group = PlasmaID, shape = RNAisolation, colour = Tube))+
geom_point()+
coord_flip()+
labs(x="",y="% alignment", col = "donorID")+
scale_colour_manual(values=color_panel)+
scale_shape_manual(values=shape_panel3)+
theme_point +
scale_y_continuous(limits=c(0,100)) +
theme(panel.grid.major.x = element_line(colour = "lightgrey", linetype = "dashed", size=0.2))+
labs(title = "pseudoalignment with Kallisto", col = NULL)
#ggsave("./data_output/plots/alignment_percentage.pdf", plot = ggplot2::last_plot(), dpi = 300)
#reads_overview <- full_join(reads_overview, kallisto_aligned %>% select("percent_aligned", "UniqueID"), by = c("UniqueID"))
As Clumpify (version bundled with BBMap version 38.28) does not output a log file with duplicate stats, we estimated duplicate reads by dividing the number of reads after Clumpify by the number of reads after subsampling (as this is the only step that removes reads, so all removed reads are duplicates). As expected because of the low RNA input, duplicate percentages range from 82-98%.
duplicates <- full_join(reads, kallisto_aligned, by = c("RNAID"))
ggplot(duplicates,aes(x=tube_kit_timeID,y=duplicates*100, group = PlasmaID, shape = RNAisolation, color=Tube))+
scale_shape_manual(values = shape_panel3) +
geom_point()+
labs(x="",y="% duplication", col = "tube")+
scale_colour_manual(values=color_panel)+
theme_point +
scale_y_continuous(limits=c(NA,100)) +
coord_flip()+
theme(panel.grid.major.x = element_line(colour = "lightgrey", linetype = "dashed", size=0.2))+
labs(title = "duplicate rate with Clumpify", col = NULL)
#ggsave("./data_output/plots/duplicate_percentage.pdf", plot = ggplot2::last_plot(), dpi = 300)
ggplot(duplicates,aes(x=tube_kit_timeID,y=post, group = PlasmaID, shape = RNAisolation, color=Tube))+
scale_shape_manual(values = shape_panel3) +
geom_point()+
labs(x="",y="reads remaining", title = "duplicate rate with Clumpify (abs. # reads)", col = NULL)+
scale_colour_manual(values=color_panel)+
theme_point +
scale_y_continuous(limits=c(NA,NA)) +
coord_flip()+
theme(panel.grid.major.x = element_line(colour = "lightgrey", linetype = "dashed", size=0.2))
#ggsave("./data_output/plots/duplicate_counts_after_dup.pdf", plot = ggplot2::last_plot(), dpi = 300)
tmp_summary_dupl <- duplicates %>% group_by(GraphTube) %>% dplyr::summarize(mean_duplicates= mean(duplicates), sd_duplicates=sd(duplicates)) %>% mutate(CV_duplicates = sd_duplicates/mean_duplicates*100) %>% left_join(unique(dplyr::select(duplicates, c("Tube","GraphTube"))), by="GraphTube")
FC <- generateFC(input_df = duplicates, variable = "duplicates", dfName = "duplicates")
names(FC)[names(FC) == 'foldchange'] <- 'duplicates_FC'
FC_all <- FC %>% dplyr::select(-inputType)
#duplicates_FC_plot <- ggplot2::last_plot()
#ggsave("./data_output/plots/hemolysis_FC.pdf", plot = ggplot2::last_plot(), dpi = 300)
A strand specific protocol was used. To test if this worked as expected, we used RSeQC on BAM output files after STAR alignment to infer strandedness:
strandedness = read.table(paste0(data_path, "RSeQC_output_nodedup.txt"),header=F)
#cat */dedup_clumpify-subs_atstart/RSeQC_output.txt > RSeQC_output_nodedup.txt
colnames(strandedness) = c("sample_name","strand")
strandedness$sample_name <- gsub("-.*$","",strandedness$sample_name)
strandedness <- merge(duplicates, strandedness, by.x = "RNAID", by.y = "sample_name")
ggplot(strandedness,aes(x=tube_kit_timeID,y=100*strand, group = PlasmaID, shape = donorID, color=Tube))+
scale_shape_manual(values = shape_panel1) +
geom_point()+
coord_flip()+
theme_point +
labs(y = "% correct strand with RSeQC", x= "tube", title = NULL, col = NULL)+
scale_colour_manual(values=color_panel)+
theme(panel.grid.major.x = element_line(colour = "lightgrey", linetype = "dashed", size=0.2))+
scale_y_continuous(limits=c(NA,100))
#ggsave("./data_output/plots/strandedness.pdf", plot = ggplot2::last_plot(), dpi = 300)
# strandedness_T0 <- strandedness %>% filter(TimeLapse == "T0")
#
# p1 <- ggplot(strandedness_T0,aes(x=reorder(donor_sampleID,dplyr::desc(donor_sampleID)),y=100*strand,col=RNAisolation, shape=GraphTube))+
# geom_point()+ coord_flip() +
# ylab("% correct strand") +
# scale_colour_manual(values = color_panel2) +
# scale_shape_manual(values = shape_panel2) +
# scale_y_continuous(limits=c(min(strandedness$strand*100)-5,NA)) +
# labs(x="") +
# theme_point
#
# strandedness_T04 <- strandedness %>% filter(TimeLapse == "T04")
#
# p2 <- ggplot(strandedness_T04,aes(x=reorder(donor_sampleID,dplyr::desc(donor_sampleID)),y=100*strand,col=RNAisolation, shape=GraphTube))+
# geom_point()+ coord_flip() +
# ylab("% correct strand") +
# scale_colour_manual(values = color_panel2) +
# scale_shape_manual(values = shape_panel2) +
# scale_y_continuous(limits=c(min(strandedness$strand*100)-5,NA)) +
# labs(x="") +
# theme_point
#
# strandedness_T16 <- strandedness %>% filter(TimeLapse == "T16")
#
# p3 <- ggplot(strandedness_T16,aes(x=reorder(donor_sampleID,dplyr::desc(donor_sampleID)),y=100*strand,col=RNAisolation, shape=GraphTube))+
# geom_point()+ coord_flip() +
# ylab("% correct strand") +
# scale_colour_manual(values = color_panel2) +
# scale_shape_manual(values = shape_panel2) +
# scale_y_continuous(limits=c(min(strandedness$strand*100)-5,NA)) +
# labs(x="") +
# theme_point
#
# p_all <- ggarrange(p1 + labs(title="T0"),p2 + labs(title="T04"),p3 + labs(title="T16"), common.legend = TRUE, ncol=2, nrow = 2, legend = "bottom")
#
# ggsave(plot=p_all, filename = file.path(output_path, "plots", "strandedness.pdf"), height=10, width=10, dpi = 300, useDingbats=F)
#
# rm(strandedness, p1, p2, p3)
Our pipeline was written to count reads using Kallisto, based on Ensembl v91. In this step, we make the raw count dataframes, as well as the counts per million (cpm) and transcripts per kilobase million (tpm) dataframes. The ERCC/Sequin spikes are filtered out, we group transcripts per gene and then sum the counts or tpm.
ensembl <- useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl")
genes_ens <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id'),mart=ensembl)
genes_ens<-rbind(genes_ens,c("rDNA45S","gi|555853|gb|U13369.1|HSU13369"))
if (!file.exists("./kallisto_counts.tsv") & !file.exists("./kallisto_tpm.tsv")){
kallisto<-as.data.frame(genes_ens)
kallisto_tpm<-as.data.frame(genes_ens)
files<-list.files(data_path,recursive=TRUE)
files<-files[grep("abundance.tsv", files)]
for(i in 1:length(files)){
tmp<-data.table::fread(paste0(data_path,files[i]), header=T, sep="\t", data.table=FALSE)
name_sample<-gsub(".*/","",sub("-[0-9]*/dedup_clumpify-subs_atstart/[0-9]*_klout","",sub("/abundance.tsv","",files[i])))
tmp1<-tmp[,c("target_id","est_counts")]
tmp2<-tmp[,c("target_id","tpm")]
colnames(tmp1)<-c("target_id",name_sample)
colnames(tmp2)<-c("target_id",name_sample)
kallisto<-full_join(kallisto,tmp1,by=c("ensembl_transcript_id"="target_id"))
kallisto_tpm<-full_join(kallisto_tpm,tmp2,by=c("ensembl_transcript_id"="target_id"))
}
write_tsv(kallisto, "./kallisto_counts.tsv")
write_tsv(kallisto_tpm, "./kallisto_tpm.tsv")
} else {
kallisto <- read_tsv("./kallisto_counts.tsv")
kallisto_tpm <- read_tsv("./kallisto_tpm.tsv")
}
kallisto <- as_tibble(kallisto)
kallisto_tpm <- as_tibble(kallisto_tpm)
kallisto <- kallisto[rowSums(kallisto %>% select_if(is.numeric),na.rm=TRUE)>0,]
kallisto_tpm <- kallisto_tpm[rowSums(kallisto_tpm %>% select_if(is.numeric),na.rm=TRUE)>0,]
kallisto$ensembl_gene_id <- ifelse(grepl("ERCC", kallisto$ensembl_transcript_id), "ERCC",
ifelse(grepl("R1|R2", kallisto$ensembl_transcript_id), "Sequin",
kallisto$ensembl_gene_id))
kallisto_tpm$ensembl_gene_id <- ifelse(grepl("ERCC", kallisto_tpm$ensembl_transcript_id), "ERCC",
ifelse(grepl("R1|R2", kallisto_tpm$ensembl_transcript_id), "Sequin",
kallisto_tpm$ensembl_gene_id))
kallisto_gene_level <- kallisto %>% filter(!str_detect(ensembl_transcript_id,"Spike")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE)
#add samples from phase 1
# kit_kallisto_gene_level <- read_tsv("./exRNAQC004_kallisto_gene_level.txt")
# tube_kallisto_gene_level <- read_tsv("./exRNAQC005_kallisto_gene_level.txt")
# tube_kit_kallisto_gene_level <- full_join(kit_kallisto_gene_level,tube_kallisto_gene_level)
# colnames(tube_kit_kallisto_gene_level) <- gsub("L1","",colnames(tube_kit_kallisto_gene_level))
kallisto_genes <- kallisto %>% filter(!str_detect(ensembl_transcript_id,"ERCC|R1|R2|Spike")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE) #filter out spikes and group counts by gene_id (instead of transcript_id)
kallisto_genes_tpm <- kallisto_tpm %>% filter(!str_detect(ensembl_transcript_id,"ERCC|R1|R2|Spike")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE)
kallisto_cpm <- as.data.frame(apply(kallisto %>% select_if(is.numeric),2,function(x) {10^6*x/sum(x,na.rm=TRUE)}))
kallisto_cpm$ensembl_transcript_id <- kallisto$ensembl_transcript_id
kallisto_cpm$ensembl_gene_id <- kallisto$ensembl_gene_id
kallisto_genes_cpm<-kallisto_cpm %>% filter(!str_detect(ensembl_transcript_id,"ERCC|R1|R2|Spike")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE)
In our samples, we add two types of spikes:
Original spike concentrations in mix are taken from providers’ annotation files (Garvan Institute of Medical Research for Sequins and ThermoFisher Scientific for ERCCs)
We can look at the absolute number of spike counts. However, this is not that informative, it is better to look at ratio of endogenous RNA reads to spike reads.
kallisto_spikes <- kallisto %>% filter(str_detect(ensembl_transcript_id,"ERCC|R1|R2")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE) #filter out spikes and group counts by gene_id (instead of transcript_id)
kallisto_spikes_tpm <- kallisto_tpm %>% filter(str_detect(ensembl_transcript_id,"ERCC|R1|R2")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE)
spike_counts <- gather(kallisto_spikes, key="RNAID", value="est_counts", -ensembl_gene_id) %>% left_join(strandedness, by="RNAID") %>% filter(str_detect(ensembl_gene_id,"Sequin"))
total_count <- kallisto %>% summarise_if(is.numeric, sum, na.rm=TRUE) %>% gather(key="RNAID", value="total_count") %>% left_join(spike_counts, by="RNAID")
spikes_tpm <- gather(kallisto_spikes_tpm, key="RNAID", value="est_counts", -ensembl_gene_id) %>% left_join(strandedness, by="RNAID") %>% filter(str_detect(ensembl_gene_id,"Sequin"))
total_tpm <- kallisto_tpm %>% summarise_if(is.numeric, sum, na.rm=TRUE) %>% gather(key="RNAID", value="total_count") %>% left_join(spikes_tpm, by="RNAID")
ggplot(gather(kallisto_spikes, key="RNAID", value="est_counts", -ensembl_gene_id) %>% left_join(strandedness, by="RNAID"),
aes(x=tube_kit_timeID, y=(est_counts/min(est_counts)), colour=ensembl_gene_id, shape = donorID)) +
scale_shape_manual(values = shape_panel1) +
geom_point()+
scale_y_log10(labels=full_nr)+
coord_flip() +
theme_point +
geom_hline(yintercept=1, color="grey88",size=0.3) +
theme(axis.ticks.y = element_blank(),axis.line.y = element_blank(),panel.grid.major.y=element_line(linetype = "dashed",color="lightgray",size=0.2),
legend.title = element_blank()) +
labs(title="ERCC and Sequin spikes", y="relative sum of spike counts (rescaled to min value)",x="")+
scale_color_manual(values=color_panel[5:6])
#ggsave("./data_output/plots/ERCCandSequin-spikes.pdf", plot = ggplot2::last_plot(), dpi = 300)
ERCC spikes are added to RNA eluate
spikes_ERCC<-kallisto_tpm %>% filter(str_detect(ensembl_transcript_id,"ERCC")) %>% dplyr::select(-ensembl_gene_id) %>% dplyr::rename(spike_id=ensembl_transcript_id)
spikes_ERCC<-spikes_ERCC[rowSums(spikes_ERCC %>% select_if(is.numeric))>0,]
spikes_ERCC_melt<-melt(spikes_ERCC)
spikes_ERCC_melt<-left_join(spikes_ERCC_melt,spike_conc_ERCC,by="spike_id") %>% mutate(RNAID=variable)
spikes_ERCC_melt<-left_join(spikes_ERCC_melt,strandedness, by="RNAID")
spikes_ERCC_melt$spike_id<-factor(spikes_ERCC_melt$spike_id,levels=spikes_ERCC_melt$spike_id[rev(order(rowMeans(spikes_ERCC %>% select_if(is.numeric))))]) #order decreasing acc to row mean
TPM per spike (different plots) and per sample (x-axis)
These plots are based on TPM values as this corrects for length of the spike. The ratio Sequin to endogenous RNA is the same in every tube in start of experiment + all sequencing data subsampled to 14 M. Thus, we expect to have an equal count of Sequins in every tube if all tubes perform equally. The differences that we see here are not related to differences in concentrations at the start
# total amount of sequin reads
ggplot(gather(kallisto_spikes, key="RNAID", value="est_counts", -ensembl_gene_id) %>% left_join(strandedness, by="RNAID") %>% filter(str_detect(ensembl_gene_id,"Sequin")),
aes(y=tube_kit_timeID, x=(est_counts), fill=Tube, colour=Tube, shape = RNAisolation)) +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
geom_point(size=2)+
theme_point +
labs(title="Total amount of sequin reads", x="sequin read count",y="")
# percentage of reads going to sequins
ggplot(total_count, aes(y=tube_kit_timeID, x=(est_counts/total_count*100), fill=Tube, colour=Tube, shape = RNAisolation)) +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
geom_point(size=2)+
theme_point +
labs(title="Sequin spike percentage", x="sequin reads/total reads (in % reads)",y="")
#ggsave("./data_output/plots/sequinVStotal_perc_reads.png", plot = ggplot2::last_plot(), dpi = 300)
# percentage of reads going to sequins (tpm)
ggplot(total_tpm, aes(y=tube_kit_timeID, x=(est_counts/10000), fill=Tube, colour=Tube, shape = RNAisolation)) +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
geom_point(size=2)+
theme_point +
labs(title="Sequin spike percentage (tpm)", x="sequin reads/total reads (in % tpm)",y="")
#ggsave("./data_output/plots/sequinVStotal_perc_tpm.png", plot = ggplot2::last_plot(), dpi = 300)
spikes_Seq<-kallisto_tpm %>% filter(str_detect(ensembl_transcript_id,"R1|R2")) %>% dplyr::select(-ensembl_gene_id) %>% dplyr::rename(spike_id=ensembl_transcript_id)
spikes_Seq<-spikes_Seq[rowSums(spikes_Seq %>% select_if(is.numeric))>0,]
spikes_Seq$gene_spike_id<-sub("_[0-9]$","",spikes_Seq$spike_id)
spikes_Seq_gene<- spikes_Seq %>% group_by(gene_spike_id) %>% summarise_if(is.numeric,sum)
spikes_Seq_melt <- melt(spikes_Seq_gene)
spikes_Seq_melt<-left_join(spikes_Seq_melt,spike_conc_gene_Seq,by=c("gene_spike_id"="spike_id"))
spikes_Seq_melt<- spikes_Seq_melt %>% dplyr::rename(RNAID=variable)
spikes_Seq_melt<-left_join(spikes_Seq_melt,strandedness)
spikes_Seq_melt$gene_spike_id<-factor(spikes_Seq_melt$gene_spike_id,levels=spikes_Seq_melt$gene_spike_id[rev(order(rowMeans(spikes_Seq_gene[,-1])))])
ggplot(spikes_Seq_melt, aes(y = log(value*EluateV+1,10), x=RNAID, colour = mix1, fill = mix1)) +
geom_bar(stat="identity") +
facet_wrap(~gene_spike_id) +
theme_bar+
labs(title="Sequin log10 TPM in total eluate (per spike)", y="log10(TPM*EluateV+1)")+
theme(strip.text.x = element_text(size=6),axis.text.x = element_blank())+
scale_fill_gradient(high="red",low="orange")+
scale_color_gradient(high="red",low="orange")
#ggsave("./data_output/plots/Sequinlog10TPMperspike.pdf", plot = ggplot2::last_plot(), dpi = 300)
highest_spikes <- levels(spikes_Seq_melt$gene_spike_id)[1:25] #get the names of the 16 highest spikes (levels are sorted according to rowmean from high to low)
spikes_Seq_melt_high <- filter(spikes_Seq_melt, gene_spike_id %in% highest_spikes)
fit_augment_spikes_Seq<-spikes_Seq_melt_high %>%
group_by(RNAID) %>%
do(broom::augment(lm(log(value+1,10)~log(mix1+1,10),data = .))) %>%
dplyr::rename(logvalue=`log(value + 1, 10)`,logmix1=`log(mix1 + 1, 10)`)
fit_glance_spikes_Seq<-spikes_Seq_melt_high %>%
group_by(RNAID) %>%
do(broom::glance(lm(log(value+1,10)~log(mix1+1,10),data = .)))
spikes_Seq_melt_high <- left_join(spikes_Seq_melt_high, as.tibble(cbind(RNAID= fit_glance_spikes_Seq$RNAID, R2= fit_glance_spikes_Seq$r.squared)), by="RNAID")
spikes_Seq_melt_high$lab = paste0(spikes_Seq_melt_high$tube_kit_timeID, "\nR^2 = ", round(as.numeric(spikes_Seq_melt_high$R2),3)) #get Rsquared value for each plot
ggplot(spikes_Seq_melt_high, aes(y=log(value+1,10), x=log(mix1+1,10))) +
geom_jitter(cex=0.2) +
theme_point+
labs(title="Sequin log10 tpm per sample (25 highest spikes)", y="log10(counts+1)", x="log10(mix_conc+1)")+
stat_smooth(method=lm,color="darkgray",se=TRUE, fill= "gray", level = .95,size=0.5)+
scale_x_continuous(limits=c(2,NA)) +
facet_wrap(~as.character(lab)) +
theme(strip.text.x = element_text(size=9))
#ggsave("./data_output/plots/Sequinlog10TPMpersample_16-highest-spikes.pdf", plot = ggplot2::last_plot(), dpi = 1000)
The volume of Sequin spikes that is added is proportional to plasma input volume (1 microL Sequin/ 100 microL plasma)
There is a clear difference in the endogenous/Sequin ratio between both kits. This would mean there is a more efficient extraction of Sequins by the miRNeasySP kit compared to the QIAamp kit.
gene_level_ratios <- rbind(kallisto_genes %>% summarise_if(is.numeric, sum, na.rm=TRUE),
kallisto_spikes %>% select_if(is.numeric)) %>%
cbind(type=c("endogenous","ERCC","Sequin")) %>% #add the type in a new column
gather(., key="RNAID",value="counts",-type) %>% #long format
spread(., key = "type", value="counts") %>%
mutate("ERCCvsEndo"=ERCC/endogenous,
"SeqvsEndo"=Sequin/endogenous,
"EndovsSeq"=endogenous/Sequin,
"EndovsERCC"= endogenous/ERCC) %>%
left_join(., strandedness, by="RNAID") #add annotation
#write.csv(gene_level_ratios, file='./data_output/gene_level_ratios.csv')
# Plot original ratio endo/seq
ggplot(gene_level_ratios, aes(y=tube_kit_timeID, x=EndovsSeq, fill=Tube, colour=Tube, shape = RNAisolation)) +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
geom_point(size=2) +
theme_bar +
labs(y="", x="endogenous/sequin", title="relative RNA concentration in plasma", col = NULL, fill = NULL)
#ggsave("./data_output/plots/EndovsSeq.pdf", plot = ggplot2::last_plot(), dpi = 300)
#calculate statistics for Sequin/endogenous (sd, sem, 95% ci)
test <- log_rescaling_CI(gene_level_ratios, measurevar="EndovsSeq", groupvar=c("tube_kitID"))
# Plot ERCC/endo in log10 scale
# spikes2 <- ggplot(test, aes(x=tube_kit_timeID, y=10^measurevar_log_resc)) +
# #geom_bar(position=position_dodge(), stat="identity")+
# geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
# geom_point(aes(colour=Tube), size=1.2) +
# geom_point(data=test, aes(x=tube_kitID, y=mean_oriscale), shape=4, colour="grey") +
# geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
# labs(x="", y="relative Endo/Sequin", title="endogenous vs Sequin counts (rescaled)", col = "") +
# scale_colour_manual(values=color_panel) +
# scale_y_log10() +
# theme_bar
p1 <- ggplot(test,aes(x=TimeLapse,y=10^measurevar_log_resc, col=donorID, group = donorID))+
geom_point()+geom_line()+
theme_point+
facet_wrap(~tube_kitID, ncol = 2, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_color_manual(values=color_panel) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
scale_y_log10() +
labs(x="", y="relative Endogenous/Sequin", title="endogenous counts vs Sequin (rescaled to lowest)", col = "")
ggplotly(p1)
#ggsave("./data_output/plots/SeqVsEndo_rescaled_linePlot.pdf", plot = p1, dpi = 300)
rm(p1)
ERCC spikes were each time added in the same amount (2 microL) to 12 microL of eluate after RNA isolation. The ratio of endogenous RNA to ERCC reflects the relative concentration of endogenous RNA in the eluate. The higher the endogenous RNA concentration in used fraction of eluate, the less ERCCs, the higher the ratio endo/ERCC.
QIAamp has a higher conc of endogenous RNA compared to miRNAeasy, but this is to be expected as the plasma input volume is also 10x higher.
#kallisto_genes <- kallisto %>% filter(!str_detect(ensembl_transcript_id,"ERCC|R1|R2")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE)
ggplot(gene_level_ratios, aes(x=EndovsERCC, y=tube_kit_timeID, fill=Tube, colour=Tube, shape = RNAisolation)) +
#geom_bar(stat="identity") +
geom_point(size=2) +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
theme_bar +
labs(y="", x="Endogenous/ERCC", title="relative RNA concentration", subtitle="endogenous RNA vs ERCC", col = NULL, fill = NULL)
#ggsave("./data_output/plots/EndovsERCC.pdf", plot = ggplot2::last_plot(), dpi = 300)
#calculate statistics for endogenous/ERCC (sd, sem, 95% ci)
test <- log_rescaling_CI(gene_level_ratios, measurevar="EndovsERCC", groupvar=c("tube_kitID"))
# Plot ERCC/endo in log10 scale
spikes_conc <- ggplot(test, aes(x=tube_kit_timeID, y=10^measurevar_log_resc)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
geom_point(aes(colour=Tube), size=1.2) +
geom_point(data=test, aes(x=GraphTube, y=mean_oriscale), shape=4, colour="grey") +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
labs(x="", y="Relative endogenous RNA concentration", title="RNA concentration", subtitle="Endogenous RNA vs ERCC (rescaled)") +
scale_colour_manual(values=color_panel) +
scale_y_log10() +
theme_bar
tmp_summary <- gene_level_ratios %>% group_by(tube_kitID) %>% dplyr::summarize(mean_EndovsERCC= mean(EndovsERCC), sd_EndovsERCC=sd(EndovsERCC)) %>% mutate(CV_EndovsERCC = sd_EndovsERCC/mean_EndovsERCC*100) %>% left_join(unique(dplyr::select(gene_level_ratios, c("Tube","GraphKit","tube_kitID"))), by="tube_kitID")
ggplot(gene_level_ratios, aes(x = EndovsERCC, y = EndovsSeq, col = tube_kitID)) +
geom_point()+geom_smooth(method = "lm", se = TRUE, aes(col = FALSE))+
theme_point + geom_abline(intercept=0, slope=1, color = "gray", linetype = "dashed")+
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_color_manual(values=color_panel) +
scale_y_continuous(trans = "log2") + scale_x_continuous(trans = "log2") +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) + labs(x = "endogenous/ERCC", y = "endogenous/Sequin", title = "correlation of rel. RNA concentration in plasma (Sequin) and eluate (ERCC)", col = NULL) +
stat_cor(method = "spearman", aes(col = FALSE, label=paste(..r.label.., cut(..p..,breaks = c(-Inf, 0.0001, 0.001, 0.01, 0.05, Inf), labels = c("'****'", "'***'", "'**'", "'*'", "'ns'")), sep = "~")))
#ggsave("./data_output/plots/SeqVsEndo_Spearman.pdf", plot = ggplot2::last_plot(), dpi = 300)
p1 <- ggplot(test,aes(x=TimeLapse,y=10^measurevar_log_resc, col=donorID, group = donorID))+
geom_point()+geom_line()+
theme_point+
facet_wrap(~tube_kitID, ncol = 2, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
scale_y_log10() +
labs(x="", y="relative endogenous RNA concentration", title="evolution of relative RNA concentration", subtitle="Endogenous RNA vs ERCC (rescaled to lowest)", col = "")
ggplotly(p1)
#ggsave("./data_output/plots/ERCCVsEndo_linePlot.pdf", plot = ggplot2::last_plot(), dpi = 300)
knitr::include_graphics('./exRNAQC004_plots/rel_RNA-conc.png')
knitr::include_graphics('./exRNAQC005_plots/rel_RNA-conc.png')
FC <- generateFC(input_df = gene_level_ratios, variable = "EndovsSeq", dfName = "RNA concentration based on Sequin")
names(FC)[names(FC) == 'foldchange'] <- 'EndoVsSeq_FC'
FC_all <- merge(FC %>% dplyr::select(-inputType), FC_all, on = c("tube_kitID","donorID", "RNAisolation","Tube","ReplID", "Replicates","TimePoint"))
#ggsave("./data_output/plots/RNAconc_FC.pdf", plot = ggplot2::last_plot(), dpi = 300)
RNAconc_FC_plot <- ggplot2::last_plot()
Similar to phase 1, we see that the miRNeasy kit has a higher extraction efficiency than the QIAamp kit. However, the miRNeasy kit also gives more variable results.
# Correct relative concentration for eluate V (and for input V for next metric)
gene_level_ratios <- gene_level_ratios %>%
mutate("EndovsERCC_InputCorr"= (endogenous/ERCC)*EluateV/PlasmaInputV,
"SeqvsERCC_InputCorr" = (Sequin/ERCC)*EluateV/PlasmaInputV)
ggplot(gene_level_ratios, aes(x=tube_kit_timeID, y=EndovsERCC_InputCorr, fill=Tube, colour=Tube, shape = RNAisolation)) +
#geom_bar(stat="identity") +
geom_point(size=1.2) +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
coord_flip() +
theme_point +
labs(x="", y="Relative efficiency (endogenous/ERCC)*EluateV/PlasmaInputV)", title="Efficiency of kits (endogenous)", subtitle="based on endogenous RNA & ERCC spikes")
Fold changes of extraction efficiency
FC <- generateFC(input_df = gene_level_ratios, variable = "EndovsERCC_InputCorr", dfName = "extraction efficiency")
names(FC)[names(FC) == 'foldchange'] <- 'efficiency_FC'
FC_all <- merge(FC %>% dplyr::select(-inputType), FC_all, on = c("tube_kitID","donorID", "RNAisolation","Tube","ReplID", "Replicates","TimePoint"))
#ggsave("./data_output/plots/GeneCount_filtered_FC.pdf", plot = ggplot2::last_plot(), dpi = 300)
geneCount_FC_plot <- ggplot2::last_plot()
knitr::include_graphics('./exRNAQC004_plots/efficiency.png')
Not available
# relative efficiency based on endogenous RNA
eff_endo <- log_rescaling_CI(gene_level_ratios, measurevar="EndovsERCC_InputCorr", groupvar=c("tube_kitID")) %>%
mutate(measurevar_resc_oriscale = 10^measurevar_log_resc)
ggplot(eff_endo, aes(x=tube_kitID, y=measurevar_resc_oriscale)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
geom_point(aes(colour=Tube, shape = RNAisolation), size=2) +
geom_point(data=eff_endo, aes(x=tube_kitID, y=mean_oriscale), shape=4, colour="grey") +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
labs(x="", y="relative RNA isolation efficiency") +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
scale_y_log10() +
coord_flip() + theme(legend.position="none") +
theme_point
#ggsave("./data_output/plots/efficiency_endo.pdf", plot=ggplot2::last_plot(), height=4, width=6, dpi = 300, useDingbats=F)
Relative efficiency of kits. Efficiency: plasma input volume corrected RNA yield. Values were first log transformed and rescaled to the average of the lowest yield, then transformed back to linear scale. Mean per kit (cross) and 95% confidence intervals shown.
# relative efficiency based on Sequin spikes
eff_seq <- log_rescaling_CI(gene_level_ratios, measurevar="SeqvsERCC_InputCorr", groupvar=c("tube_kitID")) %>%
mutate(measurevar_resc_oriscale = 10^measurevar_log_resc)
ggplot(eff_seq, aes(x=tube_kitID, y=measurevar_resc_oriscale)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
geom_point(aes(colour=Tube, shape = RNAisolation), size=2) +
geom_point(data=eff_seq, aes(x=tube_kitID, y=mean_oriscale), shape=4, colour="grey") +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
labs(x="", y="Relative efficiency (Sequin/ERCC)*EluateV/PlasmaInputV)", title="Efficiency of kits (Sequins), rescaled", subtitle="based on Sequin & ERCC spikes") +
scale_shape_manual(values = shape_panel3) +
scale_fill_manual(values=color_panel) +
scale_colour_manual(values=color_panel) +
scale_y_log10() +
coord_flip() + theme(legend.position="none") +
theme_point
#ggsave("./data_output/plots/efficiency_endo.pdf", plot=ggplot2::last_plot(), height=4, width=6, dpi = 300, useDingbats=F)
In order to remove genes with very few counts (and thus to improve repeatability between two replicates) we set a count cut-off at 6 counts. We based this cut-off on the results from exRNAQC004 (the RNA extraction kit study), where we had technical replicates for the combination of miRNeasy serum/plasma and EDTA tubes. Only the protein-coding genes that reach the cut-off are retained.
cutoff_kit <- data.frame(tube_kitID = gene_level_ratios$tube_kitID, median_th = 6)
ensembl <- useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl", version = 91)
genes_ens <- getBM(attributes=c('ensembl_gene_id','gene_biotype'),mart=ensembl)
#genes_ens <- data.table::fread(paste0(data_path,"gene_biotypes_ensemblv91.txt"), header=T, data.table = F)
kallisto_genes_long <- kallisto_genes %>% gather(., key="RNAID", value="est_counts", -ensembl_gene_id) %>% #long format
left_join(., dplyr::select(samples,c(RNAID,GraphTube, Tube, donorID, TimeLapse, RNAisolation, GraphKit, tube_kitID, tube_kit_timeID)), by="RNAID") %>% #add kit column
left_join(., genes_ens, by="ensembl_gene_id") #add gene biotype
#keep only protein coding genes with more than 0 counts
kallisto_genes_long <- kallisto_genes_long %>% filter(est_counts > 0) %>%
filter(gene_biotype=="protein_coding")
number_genes_detected <- kallisto_genes_long %>% group_by(RNAID) %>% dplyr::summarize(genes_above0=n())
number_genes_detected <- full_join(number_genes_detected,
kallisto_genes_long %>% group_by(RNAID) %>%
dplyr::summarize(total_est_counts_above0=sum(est_counts)),
by="RNAID")
kallisto_genes_cutoff <- kallisto_genes_long %>% filter(est_counts > 6) %>%
filter(gene_biotype=="protein_coding")
number_genes_detected <- full_join(number_genes_detected,
kallisto_genes_cutoff %>% group_by(RNAID) %>%
dplyr::summarize(genes_aboveTh=n()),
by="RNAID")
number_genes_detected <- full_join(number_genes_detected,
kallisto_genes_cutoff %>% group_by(RNAID) %>%
dplyr::summarize(total_est_counts_aboveTh=sum(est_counts)),
by="RNAID")
number_genes_detected <- left_join(number_genes_detected,
dplyr::select(samples, c(RNAID,GraphTube, Tube, donorID, TimeLapse, RNAisolation, GraphKit, tube_kitID, tube_kit_timeID, EluateV)),
by="RNAID")
#write.csv(number_genes_detected, file='./data_output/number_genes_detected.csv')
#convert to the original format
#kallisto_genes_cutoff <- dplyr::select(kallisto_genes_cutoff, ensembl_gene_id, UniqueID, est_counts, Tube, BiologicalReplicate, TimeLapse) %>%
# spread(., key=UniqueID, value=est_counts)
#write.csv(kallisto_genes_cutoff, file="./docs/kallisto_genes_cutoff.csv",row.names=FALSE, na="")
#grid.arrange(p1 + scale_y_continuous(limits=c(0,20000)) + theme(legend.position="none"),p3 + scale_y_continuous(limits=c(0,20000)) + theme(legend.position="none"),nrow=1)
FC <- generateFC(input_df = number_genes_detected, variable = "genes_aboveTh", dfName = "number of genes (filtered)")
names(FC)[names(FC) == 'foldchange'] <- 'numbergenes_FC'
FC_all <- merge(FC %>% dplyr::select(-inputType), FC_all, on = c("tube_kitID","donorID", "RNAisolation","Tube","ReplID", "Replicates","TimePoint"))
#ggsave("./data_output/plots/GeneCount_filtered_FC.pdf", plot = ggplot2::last_plot(), dpi = 300)
geneCount_FC_plot <- ggplot2::last_plot()
The ALC values are summarized based on the average of the ALC values (after removing NAs). The dot plot at the end gives an overview of all these values.
Score: lower ALC = better
#print(all_plot + labs(title="Pairwise ALCs"))
ALC_melt <- left_join(ALC, samples[,c("tube_kitID")], by="tube_kitID")
ALC$BiologicalReplicate <- gsub("[-]TUBE[0-9]*","", ALC$BiologicalReplicate)
ALC$Replicates <- gsub("^Rep04_0$","T0_04", ALC$Replicates)
ALC$Replicates <- gsub("^Rep0_04$","T0_04", ALC$Replicates)
ALC$Replicates <- gsub("^Rep16_0$","T0_16", ALC$Replicates)
ALC$Replicates <- gsub("^Rep0_16$","T0_16", ALC$Replicates)
ALC$Replicates <- gsub("^Rep16_04$","T04_16", ALC$Replicates)
ALC$Replicates <- gsub("^Rep04_16$","T04_16", ALC$Replicates)
#ALC_melt <- ALC_melt %>% filter(Replicates %in% c("Rep0_04", "Rep04_16", "Rep0_16", "Rep0_24", "Rep24_72", "Rep0_72"))
ALC$TimePoint <- ifelse(ALC$Replicates %in% c("T0_04", "T04_0"), "1st timepoint", ifelse(ALC$Replicates%in% c("T0_16", "T16_0"), "2nd timepoint", NA))
write.csv(ALC, file='./data_output/ALC.csv')
ggplot(ALC %>% filter(!is.na(TimePoint)), aes(x=reorder(tube_kitID,ALC_calc, FUN = function(x) -mean(x, na.rm = TRUE)), y = 2^ALC_calc)) +
geom_boxplot() +
geom_beeswarm(groupOnX=TRUE, size = 2.5, aes(col = TimePoint)) +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray") +
labs(subtitle = "ordered by mean", title = "pairwise ALCs (both > 0, at least 1 > cut-off)", y = "linear ALC", x = "tube", col = "comparison") +
scale_fill_manual(values=color_panel) + theme_point + theme(axis.text.x = element_text(angle = 90)) +
stat_summary(
geom = "point",
fun.y = "mean",
col = "black",
size = 3,
shape = 24,
fill = "white"
)
#ggsave("./data_output/plots/ALC_dotPlot.pdf", plot = ggplot2::last_plot(), dpi = 300)
ALC_FC_plot <- ggplot2::last_plot()
ggplot(ALC,aes(x=Replicates,y=ALC_calc, col=BiologicalReplicate, group = BiologicalReplicate))+
geom_point()+geom_line()+
theme_point+
facet_wrap(~tube_kitID, ncol = 2, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
labs(y="ALC",title="pairwise ALCs (both > 0, at least 1 > cut-off)", subtitle="lower ALC = better", col = "", x = NULL)
#ggsave("./data_output/plots/ALC_linePlot.pdf", plot = ggplot2::last_plot(), dpi = 300)
ALC_mean <- ALC %>% dplyr::group_by(tube_kitID) %>% dplyr::summarise(ALC = mean(ALC_calc))
for (replicates in unique(ALC_input_all$ReplID)) {
for (techrep in c("D1-D1", "D2-D2", "D3-D3", "D4-D4", "D5-D5")){
# replicates = "DNA Streck-Rep24_0"
# techrep = "TUBE1-TUBE1" # plot the ALC (= the colored part of the plot)
tmp <- ALC_input_all %>% dplyr::filter(ReplID==replicates) %>%
dplyr::filter(BiologicalReplicate == techrep)
if (nrow(tmp) != 0){
print(ggplot( tmp %>%
mutate(log2_ratio_resc = log2_ratio/max_ALC))+
geom_line(aes(x=log2_ratio_resc,y=perc_counter))+
#facet_wrap(~ReplID) +
geom_ribbon(aes(x=log2_ratio_resc,ymin=perc_counter,ymax=1), fill="lightblue")+
geom_hline(aes(yintercept = 1))+
theme_classic()+
scale_x_continuous(limits=c(0,1)) +
scale_y_continuous(expand = c(0, 0)) +
theme(legend.position = "none") +
labs(title=paste0(replicates, " - ", techrep),
subtitle=paste("ALC:", round(ALC %>% ungroup() %>% dplyr::filter(ReplID==replicates) %>%
dplyr::filter(BiologicalReplicate == techrep) %>% dplyr::select(ALC_calc),2)), #print ALC for this particular comparison!
y="rank percentile",x="rescaled log2 ratio"))
}
}
}
For every sample, we have plotted the percentage of counts and genes mapping to which biotypes. We expect that almost all genes that are detected are protein coding genes (as we apply an RNA exome capture-based method). Due to non-specific hybrid capture, also non-protein coding genes may be enriched.
genes_ens <- getBM(attributes=c('ensembl_gene_id','hgnc_symbol','gene_biotype','chromosome_name'),mart=ensembl)
genes_ens$gene_biotype[grep("protein coding",genes_ens$gene_biotype)]<-"protein coding"
genes_ens$gene_biotype[grep("pseudogene",genes_ens$gene_biotype)]<-"pseudogene"
genes_ens$gene_biotype[grep("TR",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("IG",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("TEC",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("^sc",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("ribozyme",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("miRNA",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("vault",genes_ens$gene_biotype)]<-"other"
genes_ens$gene_biotype[grep("antisense_RNA",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("lincRNA",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("lncRNA",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("overlapping",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("^sense",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("non_coding",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("processed_transcript",genes_ens$gene_biotype)]<-"lncRNA"
genes_ens$gene_biotype[grep("^s.*RNA$",genes_ens$gene_biotype)]<-"s(no)RNA"
genes_ens$gene_biotype[grep("MT",genes_ens$chromosome_name)]<-"MT_gene"
genes_ens$gene_biotype[grep("RNR",genes_ens$hgnc_symbol)]<-"MT_RNRgene"
kallisto_genes_tpm <- left_join(kallisto_genes_tpm,genes_ens)
kallisto_genes_tpm$gene_biotype[grep("45S",kallisto_genes_tpm$ensembl_gene_id)]<-"rRNA_45S"
kallisto_genes_tpm$gene_biotype[is.na(kallisto_genes_tpm$gene_biotype)]<-"unknown"
kallisto_genes_tpm$gene_biotype <- relevel(as.factor(kallisto_genes_tpm$gene_biotype) , 'protein_coding')
kallisto_genes<-left_join(kallisto_genes,genes_ens)
kallisto_genes$gene_biotype[grep("45S",kallisto_genes$ensembl_gene_id)]<-"rRNA_45S"
kallisto_genes$gene_biotype[is.na(kallisto_genes$gene_biotype)]<-"unknown"
kallisto_genes_cpm<-left_join(kallisto_genes_cpm,genes_ens)
kallisto_genes_cpm$gene_biotype[grep("45S",kallisto_genes_cpm$ensembl_gene_id)]<-"rRNA_45S"
kallisto_genes_cpm$gene_biotype[is.na(kallisto_genes_cpm$gene_biotype)]<-"unknown"
kallisto_genes_tpm_cumsum<-kallisto_genes_tpm %>% group_by(gene_biotype) %>% summarise_if(is.numeric,.funs=sum)
kallisto_genes_tpm_cumsum<-melt(kallisto_genes_tpm_cumsum)
kallisto_genes_tpm_cumsum<-left_join(kallisto_genes_tpm_cumsum %>% dplyr::rename(RNAID = variable),samples)
kallisto_genes_tpm_cumsum<-kallisto_genes_tpm_cumsum%>%mutate(tube_donor_timeID= paste0(GraphTube,"_",donorID,"_",TimeLapse))
ggplot(kallisto_genes_tpm_cumsum %>% filter(GraphKit == "MIR0.2"),aes(x=tube_donor_timeID,y=value,fill=gene_biotype))+
geom_bar(stat="identity",position = "fill")+
theme_bar+
facet_wrap(~Tube, nrow=1, scales="free_x") +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),axis.line.x = element_blank(),axis.ticks.x = element_blank())+
scale_y_continuous(expand=c(0,0))+
scale_fill_manual(values=c("#23496b","#5bb75b","#428bca","#e87810","#e35d6a","#ffbf00","#cc2028","#039748","pink","gray","darkgray")) +
labs(title="%TPM per biotype (miRNeasySPkit)", y = "fraction", x = "", col = NULL, fill = NULL)
#ggsave("./data_output/plots/Biotype_preservation_barPlot.pdf", plot = ggplot2::last_plot(), dpi = 300)
ggplot(kallisto_genes_tpm_cumsum %>% filter(GraphKit == "CCF2"),aes(x=tube_donor_timeID,y=value,fill=gene_biotype))+
geom_bar(stat="identity",position = "fill")+
theme_bar+
facet_wrap(~Tube, nrow=1, scales="free_x") +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),axis.line.x = element_blank(),axis.ticks.x = element_blank())+
scale_y_continuous(expand=c(0,0))+
scale_fill_manual(values=c("#23496b","#5bb75b","#428bca","#e87810","#e35d6a","#ffbf00","#cc2028","#039748","pink","gray","darkgray")) +
labs(title="%TPM per biotype (QIAamp)", y = "fraction", x = "", col = NULL, fill = NULL)
#ggsave("./data_output/plots/Biotype_nonPreservation_barPlot.pdf", plot = ggplot2::last_plot(), dpi = 300)
plot_percentage <- function(input_df, biotype){
tmp <- input_df %>% dplyr::group_by(tube_kitID) %>% dplyr::mutate(value_prct = value/sum(value))
# Plot ERCC/endo in log10 scale
prct <- ggplot(tmp %>% filter(gene_biotype == biotype), aes(x=GraphTube, y=value_prct*100)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_point(aes(colour=Tube), size=1.2) +
#geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
scale_colour_manual(values=color_panel) +
#scale_y_log10() +
theme_bar
ggplotly(prct)}
plot_percentage_evolution <- function(input_df, biotype){
tmp <- input_df %>% dplyr::group_by(tube_kitID) %>% dplyr::mutate(value_prct = value/sum(value))
p1 <- ggplot(tmp %>% filter(gene_biotype == biotype),aes(x=TimeLapse, y=value_prct*100, col=donorID, group = donorID))+
geom_point()+geom_line()+
theme_point+
facet_wrap(~tube_kitID, ncol = 2, scales="free_x") +
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel) +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
labs(x="", col = "", y=paste0(biotype, " (% of total counts)"), title= paste0("evolution of ", biotype, " gene count fraction"))
print(p1)
return(tmp)
}
evoDf <- plot_percentage_evolution(kallisto_genes_tpm_cumsum, "protein_coding")
#ggsave("./data_output/plots/Biotype_evolution_proteincoding_linePlot.pdf", plot = ggplot2::last_plot(), dpi = 300)
FC <- generateFC(input_df = evoDf %>% filter(gene_biotype == "protein_coding"), variable = "value_prct", dfName = "protein coding gene count %")
names(FC)[names(FC) == 'foldchange'] <- 'biotype_FC'
FC_all <- merge(FC %>% dplyr::select(-inputType), FC_all, on = c("tube_kitID", "donorID", "ReplID", "Replicates"))
#ggsave("./data_output/plots/Biotype_evolution_proteincoding_FC.pdf", plot = ggplot2::last_plot(), dpi = 300)
biotype_FC_plot <- ggplot2::last_plot()
tmp_summary <- kallisto_genes_tpm_cumsum %>% dplyr::group_by(tube_kitID) %>% dplyr::mutate(value_prct = value/sum(value)) %>% ungroup() %>% dplyr::group_by(tube_kitID, gene_biotype) %>% dplyr::summarize(mean_counts_prct= mean(value_prct), sd_counts_prct=sd(value_prct)) %>% mutate(CV_counts = sd_counts_prct/mean_counts_prct*100) %>%
left_join(unique(dplyr::select(samples, c("Tube","RNAisolation","tube_kitID"))), by="tube_kitID")
FC_all_means <- FC_all %>% dplyr::group_by(tube_kitID,RNAisolation,Tube) %>% dplyr::summarise_all(funs(mean), na.rm = TRUE) %>% dplyr::select(-c("donorID", "ReplID", "Replicates", "TimePoint"))
ALC_mean$ALC <- 2^ALC_mean$ALC
FC_all_means <- merge(ALC_mean, FC_all_means)
FC_all_means <- FC_all_means %>% melt()
ggplot(FC_all_means, aes(x = reorder(variable, value, FUN = function(x) -median(x, na.rm = TRUE)), y = value, group = tube_kitID, color = Tube, shape = RNAisolation))+
geom_line() + geom_point() + theme_point +
labs(title = "summary of fold changes", y = "mean FC", x = "metric", col = "tube") +
scale_color_manual(values=color_panel) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_discrete(labels=c("ALC_FC" = "ALC","numbergenes_FC" = "geneCount", "EndoVsSeq_FC" = "RNAconc", "efficiency_FC" = "extraction efficiency", "duplicates_FC" = "duplicates"))
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plyr_1.8.7 emmeans_1.7.4-1 multcomp_1.4-19 TH.data_1.1-1
## [5] MASS_7.3-57 survival_3.3-1 mvtnorm_1.1-3 nlme_3.1-157
## [9] rstatix_0.7.0 nnls_1.4 plotly_4.10.0 RCurl_1.98-1.6
## [13] ggbeeswarm_0.6.0 ggpubr_0.4.0 corrplot_0.92 gridExtra_2.3
## [17] ggrepel_0.9.1 scales_1.2.0 RColorBrewer_1.1-3 broom_0.8.0
## [21] reshape2_1.4.4 biomaRt_2.52.0 forcats_0.5.1 stringr_1.4.0
## [25] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [29] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3 ggsignif_0.6.3 ellipsis_0.3.2
## [4] estimability_1.3 XVector_0.36.0 fs_1.5.2
## [7] rstudioapi_0.13 farver_2.1.0 DT_0.23
## [10] bit64_4.0.5 AnnotationDbi_1.58.0 fansi_1.0.3
## [13] lubridate_1.8.0 xml2_1.3.3 codetools_0.2-18
## [16] splines_4.2.0 cachem_1.0.6 knitr_1.39
## [19] jsonlite_1.8.0 dbplyr_2.1.1 png_0.1-7
## [22] compiler_4.2.0 httr_1.4.3 backports_1.4.1
## [25] Matrix_1.4-1 assertthat_0.2.1 fastmap_1.1.0
## [28] lazyeval_0.2.2 cli_3.3.0 htmltools_0.5.2
## [31] prettyunits_1.1.1 tools_4.2.0 gtable_0.3.0
## [34] glue_1.6.2 GenomeInfoDbData_1.2.8 rappdirs_0.3.3
## [37] Rcpp_1.0.8.3 carData_3.0-5 Biobase_2.56.0
## [40] cellranger_1.1.0 jquerylib_0.1.4 vctrs_0.4.1
## [43] Biostrings_2.64.0 crosstalk_1.2.0 xfun_0.31
## [46] rvest_1.0.2 lifecycle_1.0.1 XML_3.99-0.9
## [49] zoo_1.8-10 zlibbioc_1.42.0 vroom_1.5.7
## [52] hms_1.1.1 parallel_4.2.0 sandwich_3.0-1
## [55] yaml_2.3.5 curl_4.3.2 memoise_2.0.1
## [58] sass_0.4.1 stringi_1.7.6 RSQLite_2.2.14
## [61] highr_0.9 S4Vectors_0.34.0 BiocGenerics_0.42.0
## [64] filelock_1.0.2 GenomeInfoDb_1.32.2 rlang_1.0.2
## [67] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.15
## [70] lattice_0.20-45 labeling_0.4.2 htmlwidgets_1.5.4
## [73] bit_4.0.4 tidyselect_1.1.2 magrittr_2.0.3
## [76] R6_2.5.1 IRanges_2.30.0 generics_0.1.2
## [79] DBI_1.1.2 mgcv_1.8-40 pillar_1.7.0
## [82] haven_2.5.0 withr_2.5.0 KEGGREST_1.36.0
## [85] abind_1.4-5 modelr_0.1.8 crayon_1.5.1
## [88] car_3.0-13 utf8_1.2.2 BiocFileCache_2.4.0
## [91] tzdb_0.3.0 rmarkdown_2.14 progress_1.2.2
## [94] grid_4.2.0 readxl_1.4.0 data.table_1.14.2
## [97] blob_1.2.3 reprex_2.0.1 digest_0.6.29
## [100] xtable_1.8-4 stats4_4.2.0 munsell_0.5.0
## [103] beeswarm_0.4.0 viridisLite_0.4.0 vipor_0.4.5
## [106] bslib_0.3.1